13 research outputs found
Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge
In this paper, we describe the systems developed by the SJTU X-LANCE team for
LIMMITS 2023 Challenge, and we mainly focus on the winning system on
naturalness for track 1. The aim of this challenge is to build a multi-speaker
multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each
of the languages has a male and a female speaker in the given dataset. In track
1, only 5 hours data from each speaker can be selected to train the TTS model.
Our system is based on the recently proposed VQTTS that utilizes VQ acoustic
feature rather than mel-spectrogram. We introduce additional speaker embeddings
and language embeddings to VQTTS for controlling the speaker and language
information. In the cross-lingual evaluations where we need to synthesize
speech in a cross-lingual speaker's voice, we provide a native speaker's
embedding to the acoustic model and the target speaker's embedding to the
vocoder. In the subjective MOS listening test on naturalness, our system
achieves 4.77 which ranks first.Comment: Accepted by ICASSP 2023 Special Session for Grand Challenge
EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance
Although current neural text-to-speech (TTS) models are able to generate
high-quality speech, intensity controllable emotional TTS is still a
challenging task. Most existing methods need external optimizations for
intensity calculation, leading to suboptimal results or degraded quality. In
this paper, we propose EmoDiff, a diffusion-based TTS model where emotion
intensity can be manipulated by a proposed soft-label guidance technique
derived from classifier guidance. Specifically, instead of being guided with a
one-hot vector for the specified emotion, EmoDiff is guided with a soft label
where the value of the specified emotion and \textit{Neutral} is set to
and respectively. The here represents the emotion
intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can
precisely control the emotion intensity while maintaining high voice quality.
Moreover, diverse speech with specified emotion intensity can be generated by
sampling in the reverse denoising process.Comment: Accepted to ICASSP202
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
The mainstream neural text-to-speech(TTS) pipeline is a cascade system,
including an acoustic model(AM) that predicts acoustic feature from the input
transcript and a vocoder that generates waveform according to the given
acoustic feature. However, the acoustic feature in current TTS systems is
typically mel-spectrogram, which is highly correlated along both time and
frequency axes in a complicated way, leading to a great difficulty for the AM
to predict. Although high-fidelity audio can be generated by recent neural
vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the
predicted mel-spectrogram from AM degrades the performance of the entire TTS
system. In this work, we propose VQTTS, consisting of an AM txt2vec and a
vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic
feature rather than mel-spectrogram. We redesign both the AM and the vocoder
accordingly. In particular, txt2vec basically becomes a classification model
instead of a traditional regression model while vec2wav uses an additional
feature encoder before HifiGAN generator for smoothing the discontinuous
quantized feature. Our experiments show that vec2wav achieves better
reconstruction performance than HifiGAN when using self-supervised VQ acoustic
feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art
performance in terms of naturalness among all current publicly available TTS
systems.Comment: This version has been removed by arXiv administrators because the
submitter did not have the authority to assign the license at the time of
submissio
VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
Although diffusion models in text-to-speech have become a popular choice due
to their strong generative ability, the intrinsic complexity of sampling from
diffusion models harms their efficiency. Alternatively, we propose VoiceFlow,
an acoustic model that utilizes a rectified flow matching algorithm to achieve
high synthesis quality with a limited number of sampling steps. VoiceFlow
formulates the process of generating mel-spectrograms into an ordinary
differential equation conditional on text inputs, whose vector field is then
estimated. The rectified flow technique then effectively straightens its
sampling trajectory for efficient synthesis. Subjective and objective
evaluations on both single and multi-speaker corpora showed the superior
synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation
studies further verified the validity of the rectified flow technique in
VoiceFlow.Comment: 4 figure, 5 pages, submitted to ICASSP 202
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Self-supervised learning (SSL) proficiency in speech-related tasks has driven
research into utilizing discrete tokens for speech tasks like recognition and
translation, which offer lower storage requirements and great potential to
employ natural language processing techniques. However, these studies, mainly
single-task focused, faced challenges like overfitting and performance
degradation in speech recognition tasks, often at the cost of sacrificing
performance in multi-task scenarios. This study presents a comprehensive
comparison and optimization of discrete tokens generated by various leading SSL
models in speech recognition and synthesis tasks. We aim to explore the
universality of speech discrete tokens across multiple speech tasks.
Experimental results demonstrate that discrete tokens achieve comparable
results against systems trained on FBank features in speech recognition tasks
and outperform mel-spectrogram features in speech synthesis in subjective and
objective metrics. These findings suggest that universal discrete tokens have
enormous potential in various speech-related tasks. Our work is open-source and
publicly available to facilitate research in this direction
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
The utilization of discrete speech tokens, divided into semantic tokens and
acoustic tokens, has been proven superior to traditional acoustic feature
mel-spectrograms in terms of naturalness and robustness for text-to-speech
(TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow
zero-shot speaker adaptation through auto-regressive (AR) continuation of
acoustic tokens extracted from a short speech prompt. However, these AR models
are restricted to generate speech only in a left-to-right direction, making
them unsuitable for speech editing where both preceding and following contexts
are provided. Furthermore, these models rely on acoustic tokens, which have
audio quality limitations imposed by the performance of audio codec models. In
this study, we propose a unified context-aware TTS framework called UniCATS,
which is capable of both speech continuation and editing. UniCATS comprises two
components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav.
CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the
input text, enabling it to incorporate the semantic context and maintain
seamless concatenation with the surrounding context. Following that,
CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into
waveforms, taking into consideration the acoustic context. Our experimental
results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms
of speech resynthesis from semantic tokens. Moreover, we show that UniCATS
achieves state-of-the-art performance in both speech continuation and editing
Acoustic Word Embeddings for End-to-End Speech Synthesis
The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained linguistic word embeddings as TTS system input. However, since linguistic information is not directly relevant to how words are pronounced, TTS quality improvement of these systems is mild. In this paper, we propose a novel and effective way of jointly training acoustic phone and word embeddings for end-to-end TTS systems. Experiments on the LJSpeech dataset show that the acoustic word embeddings dramatically decrease both the training and validation loss in phone-level prosody prediction. Subjective evaluations on naturalness demonstrate that the incorporation of acoustic word embeddings can significantly outperform both pure phone-based system and the TTS system with pre-trained linguistic word embedding
Research on Seismic Performance and Reinforcement Methods for Self-Centering Rocking Steel Bridge Piers
To study the seismic performance of self-centering circular-section rocking steel bridge piers whose functions can be restored after an earthquake, a high-precision finite element (FE) analysis model of such a bridge piers was established. The hysteresis behavior of concrete-infilled and hollow rocking steel bridge piers was compared. In response to the characteristics of the local deformation of the wall plates and elliptical deformation of the bottom surface, two reinforcement methods for the pier bottom, namely thickening the wall plate and adding longitudinal stiffeners in the plastic zone of the pier bottom, were proposed. The pseudo static analysis of bridge piers was carried out considering the effects of overall design parameters and reinforcement parameters of the pier bottom. The results indicate that the FE model used in this paper can obtain accurate horizontal load-displacement curves of rocking steel bridge piers. The hysteresis curves of the rocking steel bridge piers and infilled concrete rocking steel bridge piers is close, and directly using hollow steel bridge piers can improve the economic efficiency of the design. Compared to adding longitudinal stiffeners, the reinforcement form of thickened wall plates at the pier bottom has a better effect in improving the seismic performance of bridge piers. The reinforcement of the pier bottom has little effect on the energy dissipation capacity of the bridge pier, but it helps to reduce residual displacement and improve lateral stiffness
Fractional viscoelastic solution of stratum displacement of a shallow tunnel under the surface slope condition
The unified displacement function (UDF) is presented to describe the deformation behaviours of the tunnel profile along with time under the surface slope condition. Based on the discrete Fourier method, the third-order UDF in the physical plane is expanded to the Laurent series in the complex variable plane. The complex variable method is employed to derive the elastic analytical solution of stratum displacement, when the third-order UDF is taken as the displacement boundary condition of tunnel cross-section (DBCTC). The proposed elastic solution agrees well with the results of the finite element method for the consistent model, which verifies the correctness of the proposed analytical solution. Combining the corresponding principle and fractional Generalized Kelvin viscoelastic constitutive model, the fractional viscoelastic solution under the surface slope condition is determined. The time effect of stratum displacement is presented in two aspects: time-dependent DBCTC and time-dependent material parameters. The parameter analysis is performed to investigate influences of deformation modes of the third-order UDF, slope angle, tunnel radius and fractional order on the time effect of stratum vertical and horizontal displacement
Effects of injection pressure on cavitation and spray in marine diesel engine
Numerical simulation of the cavitation and spray in a marine diesel engine is performed to investigate the effects of injection pressure on the cavitation flow and spray characteristics in the marine diesel engine, which in turn influence atomization and combustion in the cylinder. A two-phase flow model combined with single bubble dynamics and a droplet break-up model are used to simulate cavitation and spray, respectively, and the results are compared to the experimental data. With increasing injection pressure, the pressure fluctuations inside the nozzle become more intense. The spray penetration is proportional to time at the beginning of injection. Higher injection pressure increases the spray angle. In addition, massive structures on spray edge can return to the spray body, whereas the massive structures on the spray head remain unchanged throughout its lifetime. Each additional 20 MPa of injection pressure reduces the Sauter mean diameter by approximately 9%